Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval
نویسندگان
چکیده
We describe indexing and retrieval techniques that are suited to perform terabyte-scale information retrieval tasks on a standard desktop PC. Starting from an Okapi-BM25-based default baseline retrieval function, we explore both sides of the effectiveness spectrum. On one side, we show how term proximity can be integrated into the scoring function in order to improve the search results. On the other side, we show how index pruning can be employed to increase retrieval efficiency – at the cost of reduced retrieval effectiveness. We show that, although index pruning can harm the quality of the search results considerably, according to standard evaluation measures, the actual loss of precision, according to other measures that are more realistic for the given task, is rather small and is in most cases outweighed by the immense efficiency gains that come along with it.
منابع مشابه
The TREC 2005 Terabyte Track
The Terabyte Track explores how retrieval and evaluation techniques can scale to terabyte-sized collections, examining both efficiency and effectiveness issues. TREC 2005 is the second year for the track. The track was introduced as part of TREC 2004, with a single adhoc retrieval task. That year, 17 groups submitted 70 runs in total. This year, the track consisted of three experimental tasks: ...
متن کاملIndex Pruning and Result Reranking: Effects on Ad-Hoc Retrieval and Named Page Finding
We describe experiments conducted for the TREC 2006 Terabyte track. Our experiments are centered around two concepts: Static index pruning (for increased retrieval efficiency) and result reranking (for improved precision). We investigate their effect on retrieval efficiency and effectiveness, paying special attention to the difference between ad-hoc retrieval and named page finding. We show tha...
متن کاملEffective Smoothing for a Terabyte of Text
As part of the TREC 2005 Terabyte track, we conducted a range of experiments investigating the effects of larger collections. Our main findings can be summarized as follows. First, we tested whether our retrieval system scales up to terabyte-scale collections. We found that our retrieval system can handle 25 million documents, although in terms of indexing time we are approaching the limits of ...
متن کاملDublin City University at the TREC 2005 Terabyte Track
For the 2005 Terabyte track in TREC Dublin City University participated in all three tasks: Adhoc, Efficiency and Named Page Finding. Our runs for TREC in all tasks were primarily focussed on the application of “Top Subset Retrieval” to the Terabyte Track. This retrieval utilises different types of sorted inverted indices so that less documents are processed in order to reduce query times, and ...
متن کاملMelbourne University at the 2006 Terabyte Track
This report describes the work done at The University of Melbourne for the TREC2006 Terabyte Track. For this track, we participated in all three main tasks. We continued our work with impact-based ranking and sought to reduce indexing as well as query time. However, to support the named-page task, more conventional retrieval mechanisms were also employed. The results show that, in general, the ...
متن کامل